Causal inference with observational data

Causal inference, Causality, Homophily and influence

Many methods aim to emulate controlled experiments.

Matching (statistics) methods try to approximate Randomized controlled trial with blocked assignments, assuming that the treatment is exogenous. There are fundamental challenges such as Balance sample size frontier.

Can bigger data help causal inference? Titiunik2014can points out that the actual science doesn’t change much with big data and big data by itself does not necessarily address the challenges of Causal inference. Hernán2016using suggests ways to emulate a target trial using big data. On the other hand, Eckles2021bias showed that when there are lots of observations/variables available, observational study can yield very similar estimate as Randomized controlled trial.

But Machine learning holds great potential, although Machine learning does not automatically solve causal inference problem. Lots of data means we can learn better predictive models, which can then be used to produce counterfactuals (see Varian2016causal and Prosperi2020causal). Or, the Representation learning can also help by providing the ability to represent entities (e.g., patients) with dense vectors.

Although “weak” methods like Propensity score matching cannot achieve the strength of much more strict Causal inference methods, they can still provide insights. They are not “nothing” due to the weakness of the causal power. We should think about the hierarchy of the evidence, but methods in the lower hierarchy can capture real effects. A good discussion by Andrew Gelman about this topic: Arnold Foundation and Vera Institute argue about a study of the effectiveness of college education programs in prison and see Eckles2021bias.